Session 5 : Enrichment analysis of Differentially Expressed Genes

Advanced R-course 2025

Dr. Debasish Mukherjee, Dr. Ulrike Goebel, Dr. Ali Abdallah

Bioinformatics Core Facility CECAD

2025-11-21

Slides & Code

  • [f] Full screen
  • [o] Slide Overview
  • [c] Notes
  • [h] help

git repo

R-Advanced


Clone repo

git clone https://github.com/CECADBioinformaticsCoreFacility/Advanced_R_course_2025.git


Slides Directly

https://cecadbioinformaticscorefacility.github.io/Advanced_R_course_2025/

Session 5 :: Enrichment Analysis

Enrichment Analysis

What to do with the DE genes?

Are these genes associated with particular biological functions, pathways, or processes more than we’d expect by chance?

Advantages:

  • Biological insight
  • Validation of experiment
  • Generate new hypotheses

Limitations:

  • You can only discover what is already known
  • Novel functionality will be missing
  • Existing annotations may be incorrect
  • Many species are poorly supported

Main Approaches

  • Over-representation analysis (ORA)
  • Gene Set Enrichment Analysis (GSEA) … Functional Class Scoring(FCS)
  • Pathway Topology-based methods

Categorization of genes

Regrouping genes together in meaningful sets based on:

  • involvement in a pathway
  • presence in a specific cellular location
  • shared molecular function
  • other categorizations …

Note

To check if any categories are over-represented, compare the proportion of genes in each category in your gene list to the proportion in the background set and calculate how likely it is to see your observed proportion by chance.

Categorized gene sets : Databases

Human curated:

  • Gene Ontology (GO)
  • Biological Pathways (Reactome, KEGG, WikiPathways)

Domains / Patterns:

  • Protein functional domains (SMART, InterPro)
  • Transcription factor regulated (AnimalTFDB, JASPAR)

Experimental:

  • Co-expressed genes (GeneFriends, COXPRESdb)
  • Interactions (STRING, BioGRID)
  • Hits from other studies

Molecular Signatures Database (MSigDB) is a popular resource that compiles many gene sets from various sources.

Gene Ontology (GO)

  • GO provides a controlled vocabulary to describe gene functions.

  • Three main domains:

    • Biological Process (BP) – pathways and larger processes
    • Molecular Function (MF) – biochemical activities
    • Cellular Component (CC) – where functions occur

GO as a Directed Acyclic Graph (DAG)

  • GO terms form a hierarchical structure.

  • Each GO term may have:

    • Parent terms (more general)
    • Child terms (more specific)
  • Relationships include:

    • is_a: subclass
    • part_of: component of a larger process
  • Hierarchy is not strictly a tree; terms can have multiple parents.

Implications of the GO DAG

  • A gene annotated to a specific child term is also annotated to all its parent terms.
  • GO enrichment results often show clusters of related terms.
  • Understanding the hierarchy helps interpret redundancy.

How GO Hierarchy Affects ORA

Annotation Propagation

  • Because children inherit from parents:

    • Broad parent terms often have many genes, reducing specificity.
    • Specific child terms may give more targeted insights.

Redundancy in Enrichment Results

  • ORA may flag clusters of related terms.
  • This is due to overlapping gene sets across the DAG.
  • Redundant terms can complicate interpretation.

Strategies to Manage Redundancy

  • Use tools that:

    • Summarize GO terms (e.g., REVIGO)
    • Prune the DAG (e.g., parent-child approaches)
    • Cluster enriched terms based on similarity
  • Focus on most specific significant terms for clearer insights.

  • packages like topGO, ClueGO, GO-Bayes, GO-Elit etc. help address hierarchy issues.

Key Takeaways

  • GO is a hierarchical DAG, not a simple tree.
  • Term relationships propagate annotations upward.
  • ORA enrichment results depend strongly on GO hierarchy.
  • Awareness of hierarchy helps with better interpretation and reduction of redundancy.

Over-representation analysis (ORA) Steps …

  1. Define a list of DE genes (e.g., based on p-value and fold-change thresholds)
  2. For each gene set(category), count how many DE genes are in the set
  3. Compare this to the expected number based on the background set
  4. Use statistical tests (e.g., Fisher’s exact test, hypergeometric test) to determine if the overlap is greater than expected by chance
  5. Adjust for multiple testing (e.g., Benjamini-Hochberg correction)
  6. Interpret results to identify significantly enriched categories
Categories Org Background DE results Overrepresented ?
Category 1 59/20000 45/2000 Yes/No
Category 2 150/20000 30/2000 Yes/No
Category 3 300/20000 10/2000 Yes/No

Hypergeometric testing

Symbol Description
\(N\) Total number of genes (universe)
\(M\) Number of DE genes (your list)
\(k\) Number of genes in a category(GO term)
\(n\) Number of DE genes found in that category

We test if \(n\) is greater than expected by chance.

The hypergeometric test calculates the probability of observing \(n\) DE genes in a pathway of size \(k\), given the total number of genes \(N\) and the number of DE genes \(M\).

The p-value is calculated as: \[ P(X = n) = \frac{\binom{k}{n} \, \binom{N-k}{M-n}}{\binom{N}{M}} \]

Example in R

# Define parameters
N <- 20000   # Total number of genes
k <- 150     # Genes in the GO term
M <- 2000    # Total number of DE genes
n <- 30      # DE genes in the GO term

# Calculate p-value using hypergeometric test
p_value <- phyper(q = n - 1,      # q = x-1 for upper tail
                  m = k,          # number of white balls in population
                  n = N - k,      # number of black balls in population
                  k = M,          # number of draws
                  lower.tail = FALSE)

print(p_value)
[1] 0.0001681513

This p-value indicates the probability of observing 30 or more DE genes in the GO term by chance. A low p-value (e.g., < 0.05) suggests that the GO term is significantly over-represented among the DE genes.

Note

Remember to adjust for multiple testing when evaluating p-values across many GO terms.

clusterProfiler

  • Comprehensive R package for functional enrichment analysis
  • Supports GO, KEGG, Reactome, Disease Ontology, and custom gene sets
  • Implements ORA and GSEA methods
  • Provides visualization functions (dot plots, bar plots, enrichment maps)
  • Integrates with other Bioconductor packages (e.g., DESeq2, edgeR)

Example: ORA with clusterProfiler

# Load necessary libraries
library(clusterProfiler)
library(org.Mm.eg.db)
library(ggplot2)
library(dplyr)
library(DESeq2)
library(qs)

set.seed(420)

dds <- qread("dds_all_genotype.qs")
uni_genes <- results(dds) |> as.data.frame() |>
    filter(!is.na(padj)) |>
    rownames()
de_genes_up <- results(dds) |> as.data.frame() |>
    filter(!is.na(padj),padj <= 0.05, log2FoldChange > 0)|>
    rownames()
de_genes_dn <- results(dds) |> as.data.frame() |>
    filter(!is.na(padj),padj <= 0.05, log2FoldChange < 0)|>
    rownames()

ego_result_up <- enrichGO(gene = de_genes_up,
                       universe = uni_genes,
                       OrgDb = org.Mm.eg.db,
                       keyType = "SYMBOL",
                       ont = "BP",
                       pAdjustMethod = "BH",
                       pvalueCutoff = 0.05,
                       qvalueCutoff = 0.2,
                       readable = TRUE)
# View results
#DT::datatable(head(ego_result_up))

# Visualize results
barplot(ego_result_up, font.size=14) + 
    ggtitle("GO Enrichment Analysis - Upregulated Genes : Barplot")

Visualisation with enrichplot - Dotplot

dotplot(ego_result_up, showCategory=10, font.size=14) + 
    ggtitle("GO Enrichment Analysis - Upregulated Genes : Dotplot")

Visualisation with enrichplot - Enrichment Map

# Add similarity matrix to the termsim slot of enrichment result
ego_result_up <- enrichplot::pairwise_termsim(ego_result_up)

# Enrichmap clusters the 30 most significant (by padj) GO terms
# to visualize relationships between terms
emapplot(ego_result_up, showCategory = 30) + 
    ggtitle("GO Enrichment Analysis - Upregulated Genes : Enrichment Map")

Visualisation with enrichplot - Cnetplot

cnetplot(ego_result_up, showCategory = 20, foldChange = NULL) + 
    ggtitle("GO Enrichment Analysis - Upregulated Genes : Cnetplot")

Things to Remember

Directional Gene Lists

  • Separate upregulated and downregulated genes
  • Perform enrichment analysis on each list separately
  • Compare enriched categories between up and downregulated genes
  • Provides insights into distinct biological processes affected

One search :

  • Higher power (more genes)
  • Lower enrichment
  • Mixed effects (pathways)

Two searches:

  • Easier interpretation
  • Less power
  • Higher enrichment
  • Consider biological context and research question when choosing approach
  • Using a background list can make a huge difference
  • Validate findings with additional experiments or literature review
  • Combine with other analyses (e.g., network analysis) for deeper insights